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Summary 

Instrumental variables have been widely used to estimate the causal effect of a treatment on 
an outcome. Existing confidence intervals for causal effects based on instrumental variables as¬ 
sume that all of the putative instrumental variables are valid; a valid instrumental variable is a 
variable that affects the outcome only by affecting the treatment and is not related to unmeasured 
confounders. However, in practice, some of the putative instrumental variables are likely to be 
invalid. This paper presents a simple and general approach to construct a confidence inferval fhaf 
is robusf fo possibly invalid insfrumenfs. The robusf confidence inferval has fheorefical guaran- 
fees on having fhe correcf coverage and can also be used fo assess fhe sensifivify of inference 
when insfrumenfal variables assumpfions are violafed. The paper also shows fhaf fhe robusf con¬ 
fidence inferval oulperforms fradifional confidence infervals popular in insfrumenfal variables 
liferafure when invalid insfrumenfs are presenf. The new approach is applied fo a developmenfal 
economics sfudy of fhe causal effecf of income on food expendifures. 

Some key words'. Anderson-Rubin test; Confidence interval; Hypothesis testing; Invalid instrument; Instrumental vari¬ 
able; Weak instrument. 


1. Introduction 

The insfrumenfal variables mefhod is a popular mefhod fo esfimafe fhe causal effecf of a freaf- 
menf, exposure, or policy on an oufcome when unmeasured confounding is presenf (Angrisf 
ef ah, 1996; Tan, 2006; Baiocchi ef ah, 2014). Informally speaking, fhe mefhod relies on hav¬ 
ing insfrumenfs fhaf are (Al) relafed fo fhe exposure, (A2) only affecl fhe oufcome by affecfing 
fhe exposure (no direcf effecf), and (A3) are nof relafed fo unmeasured confounders fhaf affecf 
fhe exposure and fhe oufcome (see Secfion 2-2 for defails). Unforfunafely, in many applications, 
practitioners are unsure if all of fhe candidafe insfrumenfs safisfy fhese assumpfions. For ex¬ 
ample, in Mendelian randomization, fhe candidafe insfrumenfs are genefic varianfs which are 
associafed wifh fhe exposure and may also have a direcf effecf on fhe oufcome, an effecf known 
as pleiofropy, fhereby violating (A2), or may be in linkage disequilibrium, fhereby violating (A3) 
(Davey Smifh & Ebrahim, 2003, 2004; Eawlor ef ah, 2008; Burgess ef ah, 2012; Solovieff el ah, 
2013). A similar problem arises in economics where some insfrumenfs may violale fhe exogene¬ 
ity assumption, which is a combination of (A2) and (A3) (Murray, 2006; Conley el ah, 2012). 

Violafion of (Al), known as fhe weak insfrumenf problem, has been sludied in defail; see 
Slock el ah (2002) for a survey. In confrasl, fhe liferafure on violalions of (A2) and (A3), known 
as fhe invalid insfrumenf problem (Murray, 2006), is limiled. Andrews (1999) and Andrews & 
Eu (2001) considered selecling valid insfrumenfs wilhin fhe confexl of generalized mefhod of 
momenls (Hansen, 1982), bul nof inference of fhe Ireafmenl effecf affer seleclion of valid inslru- 
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ments. Liao (2013) and Cheng & Liao (2015) considered estimation of the treatment effect when 
there is, a priori, a known, specified set of valid instruments and another set of instruments which 
may not be valid. Our work is most closely related to the recent works by Kolesar et al. (2013); 
Bowden et al. (2015) and Kang et al. (2015). Kolesar et al. (2013) and Bowden et al. (2015) 
considered the case when the instruments violate (A2) and proposed an orthogonality condition 
where the instruments’ effect on the exposure are orthogonal to their effects on the outcome. 
Kang et al. (2015) considered more general violations of (A2) and (A3) based on imposing an 
upper bound on the number of invalid instruments among the candidate instruments, without 
knowing which instruments are valid a priori or without imposing structure on the instruments’s 
effect like Kolesar et al. (2013) and Bowden et al. (2015). However, Kang et al. (2015) only 
studied point estimation and not confidence intervals. 

This paper focuses on developing confidence intervals when candidate instruments may be 
invalid, within the framework introduced by Kang et al. (2015). We propose a simple and gen¬ 
eral confidence interval procedure that theoretically guarantees the correct coverage rate in the 
presence of invalid instruments and can easily be used with traditional instrumental variables 
methods. Our robust confidence interval can also be interpreted within the context of a sensitivity 
analysis in causal inference where the interval measures the change in inference about the treat¬ 
ment effect if instruments violate (A2) and (A3). We also provide a new theoretical framework 
to study confidence intervals properties, including size and power, under invalid instruments. We 
also demonstrate that our method can produce valid and informative confidence intervals in both 
synthetic and real data. 


2. Setup 
2-1. Notation 

We use the potential outcomes notation (Rubin, 1974) for instruments laid out in Holland 
(1988). Specifically, let there be L potential candidate instruments and n individuals in the sam¬ 
ple. Let be the potential outcome if individual i had exposure d, a scalar value, and 

(z) 

instruments z, an L dimensional vector. Let be the potential exposure if the individual 
had instruments z. For each individual, we observe the outcome Yi, the exposure, Di, and in¬ 
struments Zi. In total, we have n observations of (K*, Di,Zi). We denote Y = (Yi ,... ,Yn), 
D = {Di ,..., Dn) and Z to be the n by L matrix where row i consists of Zj' where Z is as¬ 
sumed to have full rank. 

For a subset A C {1,..., L}, denote its cardinality c{A) and its complement. Also, let 
Za be an n by c{A) matrix of instruments where the columns of Za are from the set A, Pz^ = 
Za(Z^Za)~^Z^ be the orthogonal projection matrix onto the column space of Za and Rz^ 
be the residual projection matrix so that Rz^ + Pza — ^ where I is an n by n identity matrix. 
Also, for any L dimensional vector tt, let tta only consist of elements of the vector vr determined 
by the set A C {1,...,L}. 


2-2. Model and definition of valid instruments 
For two possible values of the exposure d', d and instruments z', z, we assume the following 
potential outcomes model 

y(d',z') _ yid,z) ^ + [d' - d)/3*, I Zi} = zfi)* (1) 

where , and fi* are unknown parameters. The parameter j3* represents the causal parameter 
of interest, the causal effect (divided by d' — d) of changing the exposure from d' to d on the 
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outcome. The parameter f* represents violation of (A2), the direct effect of the instruments on 
the outcome. If (A2) holds, then (f>* = 0. The parameter represents violation of (A3), the 
presence of unmeasured confounding between the instrument and the outcome. If (A3) holds, 
then = 0. 

Let TT* = (/)* + and e* = \ Zi}. When we combine equations (1) along 

with the definition of Cj, the observed data model becomes 

Yi = zf TT* + Diff + Q, E{e^ \Zi) = D (2) 

The observed model is also known as the under-identified single-equation linear model in econo- 
mefrics (page 83 of Wooldridge (2010)). The model (2) can be generalized fo include heferoge- 
neous causal effecls and non-linear effecls; see Kang el al. (2015) for delails. Also, fhe model can 
incorporate exogenous covariafes, say Xi, including an inlercepl term, and fhe Frisch-Waugh- 
Lovell Theorem allows us fo reduce fhe model wifh covariafes lo (2) (Davidson & MacKinnon, 
1993). The parameler vr* in fhe observed dafa model (2) combines bofh fhe violation of (A2), 
represented by f*, and fhe violafion of (A3), represenled by ?/)*. If bofh (A2) and (A3) are sal- 
isfied, fhen = ip* = 0 and vr* = 0. Hence, fhe value of tt* caplures whefher insfrumenls are 
valid versus invalid. Definition 1 formalizes Ibis idea. 

Definition 1. Suppose we have L candidate instruments along with the models (l)-(2). We 
say that instrument j = 1,... ,L is valid ifir* = 0 and invalid ifir* 7 ^ 0. 

When fhere is only one insfrumenf, L = 1, Definilion 1 of a valid inslrumenf is identical lo fhe 
definilion of a valid inslrumenf in Holland (1988). Specifically, assumption (A2), fhe exclusion 
reslriclion, which implies ^ for all d, z, z', is equivalenl fo (/>* = 0 and assumplion 

id z) (z) 

(A3), no unmeasured confounding, which means Yl- ’ ' and ^ are independenl of Zi for all d 
and z, is equivalenl lo ip* = 0, implying tt* = (p* + ip* = 0. Definition 1 is also a special case 
of fhe definilion of a valid inslrumenf in Angrisl el al. (1996) where here we assume fhe model 
is addifive, linear, and has a consfanl frealmenl effecl fi. Hence, when mulliple insfrumenls, 
L > 1, are presenl, our models (l)-(2) and Definilion 1 can be viewed as a generalizalion of fhe 
definition of valid insfrumenls in Holland (1988). 

Lei s = 0,..., L — 1 be fhe number of invalid insfrumenls and U be an upper bound on s plus 
1, i.e. fhe number of invalid insfrumenls is assumed lo be less lhan U. We assume lhal Ihere is al 
leasl one valid IV, even if we donT know which among Ihe L IV is valid, since if all L I Vs are 
invalid (i.e. s = L), identification would nol be possible (Kang el al., 2015). 

This selup was also considered in Kang el al. (2015) as a relaxation lo Iradilional inslrumen- 
lal variables selups where one knows exaclly which insfrumenls are valid and invalid. Also, in 
Mendelian randomization where inslrumenls are genetic, Ihe selup represenls a way for a genetic 
epidemiologisl lo impose prior beliefs aboul Iheir genetic insfrumenls’ validily. For example, 
based on Ihe investigator’s expertise and Ihe genome wide association sludies, Ihe investigator 
may provide an upper bound U, wilh smaller Us representing an investigator’s confidence lhal 
mosl of Iheir L inslrumenls are valid and vice versa. 

Finally, Ihe selup can be viewed as a sensitivity analysis commonly found in causal inference. 
In particular, similar to Ihe sensitivity analysis presented in Rosenbaum (2002), we can Ireal U as 
Ihe sensitivity parameter and vary from U = 1 to U = L where U = 1 represenls Ihe Iradilional 
case where all inslrumenls satisfy (A2) and (A3) and U = L represenls Ihe worsl case where al 
mosl L — 1 inslrumenls may violate (A2) and (A3). For each U, we can conslrucl confidence 
intervals for each U, and observe how violations of inslrumenlal variables assumptions impacl 
Ihe resulting conclusions aboul Ihe casual effecl /3*. Similar to a typical sensitivity analysis. 
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we can find U where our confidence inferval includes fhe null value of /?* = 0 and fherefore 
invalidafe fhe causal effecf. If af 17 = L, fhe confidence inferval does nof confain fhe null value, 
fhen fhe conclusion ahouf fhe causal effecf /3* is less sensitive fo violafions of fhe insfrumenfal 
variables assumptions (A2) and (A3). 


3. Robust Confidence Intervals With Invalid Instruments 
3-1. A general procedure 

Lef B* C {1,..., L} be fhe frue sef of invalid insfrumenfs. Then, are fhe sef of valid 

insfrumenfs and Zb* are exogenous covariafes we adjusf for in model (2). In fhe insfrumenfal 
variables liferafure, fhere are many fesf sfafisfics T(/3o, B*) of fhe null hypofhesis Hq : /3*=/3o 
versus fhe alfernafive Ha : /3* 7 ^ /?o where B* confains invalid insfrumenfs and (B*)^' confain 
valid insfrumenfs. Inverfing fhe fesf T{/3o, B*) under size a provides a 1 — a confidence inferval 
for P*, denofed as C'i_Q,(y, D, Z, B*). 

D, Z, B*) = {Po I T{Po, B*) < (?!_,} (3) 

where qi-a is fhe 1 — a quantile of fhe null disfribufion of T(/3o, B*). These fesfs include fhe 
fwo-sfage leasf squares, Anderson-Rubin fesf (Anderson & Rubin, 1949), fhe condifional likeli¬ 
hood rafio fesf (Moreira, 2003), and many ofhers; see Supplemenfary maferials for defails. 

Unforfunafely, in our problem, we do nof know fhe frue sef B* of invalid insfrumenfs, so 
we cannof direcfly use (3) fo esfimafe confidence infervals for P*. However, from Secfion 
2-2, we have a consfrainf on fhe number of invalid insfrumenfs, mainly s < U. We can use 
fhis consfrainf by faking unions of C'i_Q,(y, D, Z, B) over possible sefs of invalid insfrumenfs 
B C {1,..., L} where c{B) < U. The confidence inferval using fhe frue sef of invalid insfru- 
menfs C{Y, D, Z, B*) will be in fhis union since c{B*) < U. Our proposal is exacfly fhis, excepf 
fhaf we resfricf fhe subsefs B fo be of size c{B) = U — 1. 

Ci_„(y, D, Z) = UB,ciB)=u-i{Ci-a{Y, D, Z, B)} (4) 

Theorem 1 shows fhaf fhe confidence inferval in (4) has proper coverage in fhe presence of 
possibly invalid insfrumenfs. 

Theorem 1. Suppose model (2) holds and s < U. Given a, consider any test statistic 
T(Pq, B) with the property that for any B* C B, T(Pq, B) has size at most a under the null 
hypothesis Hq : P* = Pq. Then, Ci-a{Y, D, Z) in (4) always has at least 1 — a coverage even 
with invalid instruments. 

The proof is in fhe appendix. The proposed confidence inferval C'i_Q,(y, D, Z) is robusf fo invalid 
insfrumenfs. If is also simple and general. Specifically, for any fesf sfafisfic T{Pq,B) discussed 
above wifh a valid size for B* C B, one simply fakes unions of confidence infervals of r(/3o, B) 
over subsefs of insfrumenfs B where c{B) = [/ — 1. In addition, a key feafure of our procedure 
is fhaf we do nof have fo iferafe fhrough all subsefs of insfrumenfs where c{B) < f7; we only 
have fo examine fhe largesf possible sef of invalid insfrumenfs, i.e. fhose subsefs fhaf are af fhe 
upper boundary of U, c{B) = f7 — 1, fo guaranfee 1 — a coverage. 

A pofenfial caveaf fo our procedure is compufafional feasibilify. Even fhough we resfricf fhe 
union fo subsefs of exacfly size c{B) = U — 1, if fhere are many candidafe insfrumenfs L and 
U is moderafe large, C'i_Q,(y, D, Z) becomes compufafionally burdensome. However, in many 
insfrumenfal variables sfudies, if is difficulf fo find good candidafe insfrumenfs and rarely does fhe 
number of fhese candidafes insfrumenfs exceed L = 20, which modern computing can handle. 
Hence, our procedure in (4) is compufafionally fracfable for mosf practical applicafions. 
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3-2. Shorter interval with pretesting 

As shown in Theorem 1, the overall interval Ci-ffY, D, Z) whieh takes unions of subsets B 
of eonfidenee intervals Ci-aiY, D, Z, B) has the desired eoverage level as shown in Theorem 1. 
Some subsets B eontain all the invalid instruments, leading to unbiased eonfidenee intervals (i.e. 
eontain true /3* with probability greater than or equal to 1 — a), while others may not inelude all 
of the invalid instruments, leading to biased eonfidenee intervals and more importantly, elongat¬ 
ing Ci-a(y, D, Z) to aehieve the desired eoverage level. In order for Ci-ffY, D, Z) to have the 
desired eoverage level, we just need the union to eontain one unbiased eonfidenee inferval and 
ideally, we would like fo remove fhese biased eonfidenee infervals in fhe union of Ci-aiY, D, Z) 
fo end up wifh pofenfially shorfer infervals while sfill mainfaining fhe desired eoverage level. In 
fhis seefion, we propose a way fo do Ibis by pretesting whefher eaeh of fhe subsefs B^' in fhe 
union of Ci-a(Y, D, Z) eonfain only valid insfrumenfs. 

Speeifieally, for a 1 — a eonfidenee inferval, eonsider fhe null hypofhesis fhaf B^, for 
c{B^) > 2, eonfains only valid insfrumenfs, = 0. Suppose S{B) is fhe eorresponding fesf 
sfafisfie for fhis null wifh level ai < a and qi-ai is fhe 1 — ai quanfile of fhe null disfribufion 
of S{B). Then, a 1 — a eonfidenee inferval for /3* fhaf ineorporafes fhe prefesf S{B), denoted as 
C[_ffY,D,Z) is 

Ci_ffY,D,Z) = UB{Ci.aAY,D,Z,B) \ c{B) = U - 1, S{B) < (5) 

where a = ai + a 2 - For example, if fhe desired eonfidenee level for is 95% where a = 0 • 05, 
we ean sef ai = 0.01 and a 2 = 0.04 where we would eonduef fhe prefesf S{B) af 0.01 level and 
obfain Ci-a 2 iY, D, Z, B) af fhe 0.04 level. Theorem 2 shows fhaf C[_^{Y, D, Z) aehieves fhe 
desired 1 — a eoverage of fi* in fhe presenee of possibly invalid insfrumenfs. 

Theorem 2. Suppose the assumptions in Theorem 1 hold. For any pretest S{B) where 
c{B^) > 2 and has the correct size under the null hypothesis that B^ contains only valid in¬ 
struments, C[_^(Y, D, Z) always has at least \ — a coverage even with invalid instruments. 

Similar fo Theorem 1, fhe proeedure in (5) is general in fhe sense fhaf any prefesf S{B) wifh fhe 
eorreef size guaranfees fhaf fhe eonfidenee inferval will have fhe desired level of eoverage. For 
example, fhe Sargan fesf (Sargan, 1958) ean aef as a pre-fesf for (5); see Supplemenfary Materials 
for defails. 


3-3. Power under invalid instruments 

While many fesfs will satisfy fhe requiremenfs for Theorems 1 and 2, some fesfs will be heller 
lhan olhers where “heller” ean be defined in terms of slalislieal power, lenglh of fhe eonfidenee 
inferval, or eoverage, all in fhe presenee of invalid inslrumenl. In fhis seefion, we provide one 
framework fo sludy fhe power of eommon fesfs in inslrumenlal variables lileralure under invalid 
insfrumenfs as Ihey are applied fo our robusl eonfidenee infervals. Seefion 4 sludies properlies of 
fesfs wifh respeel fo eonfidenee interval eoverage and lenglh. 

To selup fhis framework, we follow fhe weak inslrumenl lileralure menlioned in Seefion 1 
where fesfs under violalion of (AI) was sludied in greal delail (Slaiger & Sloek, 1997; Andrews 

el ah, 2006, 2007), exeepl we modify if fo sludy violafions of (A2) and (A3). 

Yi = Zj'TT* + Dij3* + ej, E{ei, ^i\Zi) = 0 (6a) 

D, = Zfy* + i, (6b) 


0 \ / al paia2\ 

Oy ’ \pa1a2 af ) 


(6c) 
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The setup in ( 6 ) is a speeial ease of model (2) with the additional assumptions that (i) Di is 
related linearly to Zi and (ii) the error terms are bivariate i.i.d. Normal distributions with an 
unknown eovarianee matrix. Also, the two major differenees between the setup eonsidered in the 
weak instrumental variables work of Staiger & Stoek (1997) and Andrews et al. (2007) and the 
setup here is the introduetion of the term ZJtt* to model invalid instruments and a fixed 7 *. 

Under the model in ( 6 ), we study whether a partieular test has the power to deteet the al¬ 
ternative Ha : (3* 7 ^ f3o;TT*^c 7 ^ 0 under the null Hq : /3* = I3q\t^*^c = 0 for a given set B. The 
first alternative is the power to deteet P* ^ I3q. The seeond alternative, perhaps a more impor¬ 
tant alternative in the ease of invalid instruments, is power to deteet the wrong Bs in the union 
of Ci-a{Y, D, Z). A wrong B is where B does not eontain all the invalid instruments so that 
7 ^ 0. If a test has good power against the seeond alternative, we would be less likely to take 
the unions over wrong Bs in C'i_Q,(y, D, Z) and our robust eonfidenee interval will tend to be 
shorter. We refer to analyzing power of tests under these two alternatives as power under invalid 
instruments. 

Under this invalid power framework, we ean study the power of eommon test statisties in the 
instrumental variables literature. As an illustrative example, we study the power of the Anderson- 
Rubin test (Anderson & Rubin, 1949); see the supplementary materials for additional analysis of 
different test statisties. The Anderson-Rubin test is a popular test in instrumental variables based 
on the partial F-test of the regression eoeffieients between Y — D/3o versus Z^c, i.e. 


AR(/3o,i7) 


(y - DI3oY{Pz - Pzs){Y - Df3o)/L - c{B) 
{Y - DPoVRziY - D(3o)/{n - L) 


(V) 


The Anderson-Rubin test has some attraetive properties, ineluding robustness to weak instru¬ 
ments (i.e. violation of (Al)) (Staiger & Stoek, 1997), simple formula, robustness to first-stage 
modeling assumptions, exaet null distribution under Normality, and various others; see Dufour 
(2003) for details. A eaveat to the Anderson-Rubin is its eonservative power eompared to a few 
reeent tests (Andrews et ah, 2006; Mikusheva, 2010), sueh as the eonditional likelihood ratio test 
(Moreira, 2003), under weak instruments. However, the Anderson-Rubin test has the feature that 
it ean be used as a pretest to eheek whether the eandidate subset of instruments B eontains all the 
invalid instruments (Kleibergen, 2007). This feature is partieularly useful for our problem where 
we have possibly invalid instruments. Indeed, as we will show in Theorem 3 and in Seetion 4, 
eontrary to the weak instrument literature, the Anderson-Rubin test aetually performs better than 
the eonditional likelihood ratio test under invalid instruments. 


Theorem 3. Consider any set B C {1,..., L} with c{B) = U — 1 and the null hypothesis 
Hq : /3* = Po and B^ contains valid instruments versus the alternative Ha : j3* ^ or B^ 
contains some invalid instruments. Under the data generating model in ( 6 ), the exact power of 
A/?(/3o, B) under invalid instruments is 


pr{AR{/3o,B) > 


L — c{B),n—L,0 


} = 1 - Fl_^ 


c{B),n-L,ri{B) 


(9i- 


L —c(B),n —L,0> 
Ot ) 


( 8 ) 


where 1 — a quantile of the non-central F distribution with degrees of 

freedom L — c{B), n — L and non-centrality r]{B) = ||ii^gZ^c( 7 r^c -|- 


The power of the Anderson-Rubin test under invalid instruments is a generalization of the 
power of the Anderson-Rubin when all the instruments are valid. Speeifieally, if B eontains 
all the invalid instruments so that B* C B, then = 0 and the non-eentrality parame¬ 
ter r]{B) in Theorem 3 would eonsist of only the strength of the instruments, speeifieally 
WRzgZ^cyBC^if^* ~ /^o))!!!’ and we would refurn fo fhe usual power of fhe Anderson-Rubin 
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test with all valid instruments. On the other hand, if B does not contain all invalid instruments 
so that B* ^ B, then 0, the Anderson-Rubin test will still have power, even if fi* = /3o- 

That is, the Anderson-Rubin test will reject Hq and will generally have shorter intervals when B 
does not contain all the invalid instruments; see Section 4 for empirical verification of this phe¬ 
nomena. Also, the Anderson-Rubin has no power when vr^c * +7^c(/^* — /^o) = 0; a similar 
result was shown in Kadane & Anderson (1977) and Small (2007) when studying the power of 
overidentifying restrictions tests. Finally, we note that our power formula is exact and does not 
invoke asymptotics. 

While there are many other ways to study the power of a test under invalid instruments, we 
believe the framework, based partly on the weak instrument literature, provides a first-step ap¬ 
proximation to the behaviors of common tests under invalid instruments. The supplementary 
materials elaborates on this by (i) showing that the framework is a decent approximation to 
the invalid instrument phenomena, (ii) providing an asymptotic version of this framework un¬ 
der invalid instruments, (iii) analysis of other test statistics under this framework, including the 
two-stage least squares test, and (iv) studying the resulting power of the confidence inferval 
Ci-a{Y, D, Z) where we fake unions over all sefs B. 

4. Simulation Study With Invalid Instruments 
4-1. Coverage 

In Ibis simulation sfudy, we evaluafe fhe robusfness of our mefhod compared fo popular mefh- 
ods for confidence infervals in fhe insfrumenfal variables when fhere are invalid insfrumenfs. 
The simulation sefup follows (6) wifh n = 5000 individuals wifh L = 10 candidafe insfrumenfs 
where each pair of insfrumenfs are correlafed wifh correlation 0.6. We fix fhe paramefers fi* = 2, 
^■2 = O'! = P = 0.8, and U = 5 (i.e. af mosf 50% of fhe insfrumenfs are valid). We vary fhe pa¬ 
rameters TT* and 7 * as follows. We change 7r*’s supporf changes from 0 fo 4 fo represenf invalid 
insfrumenfs and fhe elemenfs of tt* in fhe supporf are drawn from a uniform disfribufion. For 
7 *, we sef if fwo values fhaf correspond fo fhe concenfrafion parameters 100 and 5 fo represenf 
sfrong and weak insfrumenfs. The concenfrafion parameter is fhe expecfed value of fhe F sfalislic 
for fhe coefficienls in fhe regression of D and Z and is a measure of insfrumenf sfrengfh 

(Slock el ah, 2002). For each simulalion selling, we generafe 5000 simulafed dala sefs. 

For each simulation selling, we compare our melhods in (4) and (5) fo “naive” and “oracle” 
melhods where “naive” melhods assume all candidafe insfrumenfs are valid, which is typically 
done in practice, and “oracle” melhods assume one knows exaclly which insfrumenfs are valid 
and invalid, i.e. B* is known, and use (3). Nole fhaf fhe oracle melhods are nol praclical because 
of fhe incomplele knowledge aboul exaclly which insfrumenfs are invalid versus valid. We use fhe 
following lesl slafislics popular in insfrumenfal variables lileralure, fhe Iwo-slage leasl squares 
lesl, fhe Anderson-Rubin lesl in (7), and fhe conditional likelihood ralio lesl (Moreira, 2003). 
Also, for our melhods involving prefesls in (5), we use fhe Sargan lesl as fhe prelesfs for fhe 
Iwo-slage leasl squares lesl and fhe condilional likelihood ralio lesl, bolh al level ai = 0.01 
for fhe prelesl, and a 2 = 0.04 for fhe subsequenl lesls. For simulation sellings involving weak 
insfrumenfs, we only compare belween lesls fhaf are robusl fo weak insfrumenfs, specifically 
fhe fhe Anderson-Rubin lesl and fhe condilional likelihood ratio lesl, and we do nol use fhe 
prefesling mefhod in (5), which uses fhe Sargan lesl and is known fo perform poorly wifh weak 
insfrumenfs (Slaiger & Slock, 1997); see fhe Supplemenlary materials for defails aboul fhe lesls 
and addilional delails aboul fhe simulation selling. 

Table 1 shows fhe empirical coverage proportion of differenl melhods when we vary s and fhe 
sfrengfh of fhe insfrumenfs. When fhere are no invalid insfrumenfs, s = 0, fhe naive and oracle 
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Table 1. Comparison of coverage between 95% confidence intervals 


Strength 

Case 

Test 

s = 0 

s = 1 

s = 2 

s = 3 

s = 4 

Strong 

Naive 

TSLS 

94 

0 

0 

0 

0 



AR 

95 

0 

0 

0 

0 



CLR 

95 

0 

0 

0 

0 


Our method 

TSLS 

100 

100 

100 

100 

94 



AR 

100 

100 

100 

100 

95 



CLR 

100 

100 

100 

100 

98 



SAR -f TSLS 

100 

100 

100 

100 

95 



SAR -f CLR 

100 

100 

100 

100 

96 


Oracle 

TSLS 

94 

94 

94 

94 

94 



AR 

95 

95 

95 

95 

95 



CLR 

95 

95 

95 

95 

96 

Weak 

Naive 

AR 

95 

0 

0 

0 

0 



CLR 

95 

0 

0 

0 

0 


Our method 

AR 

100 

100 

100 

100 

95 



CLR 

100 

100 

100 

100 

95 


Oracle 

AR 

95 

95 

95 

95 

95 



CLR 

95 

95 

95 

95 

95 


TSLS, two-stage least squares; AR, Anderson-Rubin test; CLR, conditional likelihood ratio 
test; SAR, Sargan test. There are L — 10 candidate instruments and U = 5. Strong instru¬ 
ments and weak instruments correspond to concentration parameters equaling 100 and 5, 
respectively. The standard error for ah the coverage proportions do not exceed 1%. 


procedures have the desired 95% coverage for both strong and weak instruments. Our methods 
have higher than 95% coverage because they do not assume that all candidate instruments are 
valid. As the number of invalid instruments, s, increases, the naive methods fail to have any 
coverage regardless of the strength of the instruments. Our methods, in contrast, have the desired 
level of coverage, with the coverage level reaching nominal levels when s is at the boundary 
of s < U, i.e. s = 4, all without knowing which instruments are valid or invalid a priori. The 
oracle methods have coverage reaching nominal levels since they know which instruments are 
valid and invalid. Finally, we note that under the worst case, where the instruments are weak 
and 0 < s < U so that essentially, all three instrumental variables assumptions (A1)-(A3) are 
violated, our methods still provide honest coverage and when s = 4, reach oracle coverage. 

In short, in the presence of possibly invalid instruments, the naive, popular approach of simply 
assuming all the instruments are valid would lead to misleading inference. In contrast, our meth¬ 
ods provide honest coverage regardless of whether instruments are invalid or valid and should 
be used whenever there is concern for possibly invalid instruments. In particular, (4) works re¬ 
gardless of the strength of the instruments while our method in (5) provides a desired level of 
coverage so long as the instruments are strong. 

4-2. Length and power 

Next, following Section 3-3, we compare tests that are used in our methods to see which tests 
are “better” with regards to the length and power of the corresponding confidence interval. The 
simulation settings are identical to Section 4-1, except we only compare between our methods 
and the oracles. Table 2 examines the median length of the 95% confidence infervals. 

For sfrong insfrumenfs, our mefhod and fhe oracles become similar as fhe number of invalid 
insfrumenfs s grows, wifh fhe Anderson-Rubin fesf and mefhods wifh prefesfing achieving oracle 
performance s = 4 while fhe fwo-sfage leasf squares and condifional likelihood rafio fesfs, bofh 
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Table 2. Comparison of median lengths between 95% confidence intervals 


Strength 

Case 

Test 

s 

= 0 

s 

= 1 

s 

= 2 

s 

= 3 

s 

= 4 

Strong 

Our method 

TSLS 

0 

• 93 

2 

• 63 

2 

•08 

3 

• 62 

5 

• 12 



AR 

1 

• 63 

0 

• 77 

0 

• 51 

0 

• 36 

0 

• 24 



CLR 

1 

•09 

12 

• 42 

8 

• 89 

9 

• 44 

8 

• 02 



SAR + TSLS 

0 

• 95 

0 

• 54 

0 

• 37 

0 

• 26 

0 

• 17 



SAR + CLR 

1 

• 12 

0 

• 58 

0 

• 38 

0 

• 27 

0 

• 17 


Oracle 

TSLS 

0 

• 12 

0 

• 13 

0 

• 14 

0 

• 15 

0 

• 16 



AR 

0 

• 20 

0 

• 21 

0 

• 21 

0 

• 22 

0 

• 24 



CLR 

0 

• 12 

0 

• 13 

0 

• 14 

0 

• 15 

0 

• 16 

Weak 

Our method 

AR 


oo 


oo 


oo 


oo 

573 

•01 



CLR 


oo 


00 


oo 


oo 


00 


Oracle 

AR 

1 

•02 

1 

•07 

1 

• 13 

1 

• 19 

1 

• 27 



CLR 

0 

• 62 

0 

• 65 

0 

• 70 

0 

• 75 

0 

• 82 


TSLS, two-stage least squares; AR, Anderson-Rubin test; CLR, conditional likelihood ratio 
test; SAR, Sargan test. There are L = 10 candidate instruments and U = 5. Strong instruments 
and weak instruments correspond to concentration parameters equaling 100 and 5, respectively. 
The interquartile range of all strong intervals do not exceed 5 • 77. The interquartile range of 
weak intervals do not exceed 1 • 00 for non-infinite intervals except when s = 4. 


without pretesting, not reaehing oraele performanee at s = 4. The improved performanee with 
pretesting is expeeted sinee pretesting was introdueed to remove taking unneeessary unions of 
intervals in (5). For weak instruments, our method produees infinite intervals with the exeeption 
of the Anderson-Rubin test at s = 4. The infinite lengths suggest that weak instruments ean 
greatly amplify the bias caused by invalid instruments, thereby forcing our robust methods to 
produce infinite intervals to retain honest coverage, something that has been observed in previous 
studies (Small & Rosenbaum, 2008). In contrast, the oracle intervals produce finite intervals 
since instrumental validity is not an issue; although if the instrument is arbitrary weak, infinite 
confidence intervals are necessary (Dufour, 1997). 

Figures 1 and 2 compares the power of the tests under strong and weak instruments, respec¬ 
tively. Each column represents each tests while each row represents different s. 

For strong instruments in Figure 1, all our methods’ power curves are dominated by the ora¬ 
cles’ power curves, which is to be expected since the oracles know exactly which instruments are 
valid and invalid. Similar to what we observed with confidence interval length, the gap between 
our methods and the oracle shrinks as s grows. Also, two-stage least squares and the conditional 
likelihood ratio test without pretesting have no power for the positive side of the alternative close 
to zero while the pretesting versions provide power in this region. The Anderson-Rubin test, 
which does not require a pretest, has power on both sides of the alternative and at s = 4, the 
Anderson-Rubin achieves oracle power with perfect overlapping curves. 

For weak instruments in Figure 2, again, our methods’ power curves are dominated by the 
oracles’ power curves. Between the two tests for our methods, the Anderson-Rubin test performs 
no worse than the conditional likelihood ratio test, with the Anderson-Rubin test dominating 
the conditional likelihood ratio test for s = 4. This finding contrasts with the advice in weak 
instruments literature where the conditional likelihood ratio generally dominates the Anderson- 
Rubin test (Andrews et ah, 2006; Mikusheva, 2010); the weak instrument setting suggests that the 
intuition from the weak instrument literature with regards to power does not necessary translate 
if invalid instruments are present. 
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Anderson-Rubin Conditional Likelihood Ratio Two-Stage Least Squares 



- 0.3 - 0.2 - 0.1 0.0 0.1 0.2 0 . 3 - 0.3 - 0.2 - 0.1 0.0 0.1 0.2 0 . 3 - 0.3 - 0.2 - 0.1 0.0 0.1 0.2 0.3 

P-Po 


Fig. 1. Power curves between our methods and the oracle methods for strong instrument. Oracles intervals 
are solid lines. Dashed lines are our methods. Dotted lines are pretesting methods. Each column represents 
each test statistic while each row represents different s. 


Anderson-Rubin Conditional Likelihood Ratio 



- 1.0 - 0.5 0.0 0.5 1.0 - 1.0 - 0.5 0.0 0.5 1.0 


P-Po 

Fig. 2. Power curves between our methods and the oracle methods for weak instrument. Circles represent 
the Anderson-Rubin tests, triangles represent the conditional likelihood ratio test, and squares represent the 
two-stage least squares. Solid lines represent oracle methods. Dashed lines represent our methods. Dotted 

lines represent pretesting methods. 












A simple and robust confidence interval for causal effects with possibly invalid instruments! 1 

In short, the simulation study here suggests (i) there may not be uniformly dominating method 
or tests used with our methods (ii) the performanee of the methods depend on s and the strength 
of the instruments, and that (iii) invalid instruments may exaeerbate the bias from weak instru¬ 
ments and traditionally strong performers in weak instrument settings, like eonditional likelihood 
ratio test, may perform poorly eompared to the Anderson-Rubin test, traditionally a weak per¬ 
former in weak instruments. For praetiee, when one has strong instruments, the Anderson-Rubin 
test and the pretesting method with two-stage least squares or eonditional likelihood ratio test 
perform well with respeet to power and length, with Anderson-Rubin test being the simpler one 
due to its laek of a pretest but the pretesting method providing somewhat shorter length. For weak 
instruments, the Anderson-Rubin test performs the best. Indeed, in this worst ease where the in¬ 
struments are weak and there are invalid instruments (i.e. all instrumental variables assumptions 
are violated), all our proeedures leads to infinite, but honest, intervals. The presenee of an infi¬ 
nite interval ean be disappointing at first, but it’s aetually informative in the sense that the data’s 
quality is insuffieient to draw any meaningful eonelusion about the treatment effeet /3*. 


5. Data Analysis: Demand for Food in Developmental Economics 

In developmental eeonomies, there is great interest in studying relationships between ineome 
and expenditures on goods and serviees to better understand welfare polieies and eeonomie 
aetivity in developing nations (Deaton, 1997). There is a long-held hypothesis, loosely ealled 
the effieient wage hypothesis, whieh suggests that raising ineome would lead to workers being 
better fed, whieh in turn lead to better produetivity in the workforee, espeeially in develop¬ 
ing eeonomies (Mirrlees, 1975; Stiglitz, 1976; Subramanian & Deaton, 1996). Consequently, a 
strong foeus in this literature is looking at the relationship between ineome and ealorie intake 
(Reutlinger & Selowsky, 1976; Behrman & Deolalikar, 1987; Ravallion, 1990; Bouis & Had¬ 
dad, 1990, 1992; Subramanian & Deaton, 1996; Dawson & Tiffin, 1998; Tiffin & Dawson, 2002; 
Abdulai & Auberf, 2004). 

To fhis end, we reanalyze fhe insfrumenfal variables analysis done in Bouis & Haddad (1990) 
and Bouis & Haddad (1992) where fhe goal was fo analyze fhe effeel of ineome on demand for 
food among n = 405 Philippine farm households. Speeifieally, fhe oufeome is household food 
expendifures, Fj. The exposure is fhe household’s log ineome, Di. We use four eandidafe insfru- 
menfs, eulfivafed area per eapifa, Zn, worfh of assefs, Zi 2 , binary indieafor of eleefrieify af fhe 
household, Zi^, and qualify of flooring af fhe house, Z^. Page 82 of Bouis & Haddad (1990) 
sfafes fhaf fhe reasoning behind proposing fhese variables as insfrumenfal variables is fhaf “land 
availabilify is assumed fo be a eonsfrainf in fhe shorf run, and Iherefore exogenous fo fhe house¬ 
hold deeision making proeess.” We also eonfrol for fhe measured eovariafes, whieh are mofher’s 
edueafion, fafher’s edueafion, mofher’s age, falher’s age, mofher’s nufrifional knowledge, priee 
of eorn, priee of riee, population density of fhe munieipalify, and number of household members 
in adulf equivalenfs; see Bouis & Haddad (1990) and Bouis & Haddad (1992) for furlher defails 
on fhe dafa. 

The F-sfafislie for insfrumenf sfrengfh is 103-77, indieafing reasonably sfrong insfrumenfs. The 
Sargan fesf for overidenfifieafion, whieh fesfs assumptions (A2) and (A3), produees a p-value of 
0 079. Even fhough fhe p-value is low, usually praefifioners of fhe insfrumenfal variables mefhod 
would assume (A2) and (A3) are frue sinee fhe p-value is above 0-0.5, fhe fypieal fhreshold 
for signifieanee level and use one of fhe four naive mefhods in Seefion 4 fo obfain eonfidenee 
intervals; see eolumn (7 = 1 in Table 3, row wifh Sargan pre-fesfing as examples of fhis fype of 
analysis. In eonfrasf, our mefhods do nof fake for granted fhaf fhe four insfrumenfs are valid and 
allow fhe insfrumenfs fo be invalid. In parfieular, we vary U away from 1 until fhe eonfidenee 
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Table 3. 95% Confidence Interval of Income’s Effect on Food 

Expenditures 


Test 

U 

= 1 (Naive) 


t/ = 2 


U = 3 

TSLS 

(0 

043, 0 

053) 

(0 

031,0- 

059) 

(-0-017, 0-064) 

AR 

(0 

044, 0 

054) 

(0 

037, 0 - 

058) 

(-0-027, 0-068) 

CLR 

(0 

043, 0 

055) 

(0 

034, 0 - 

066) 

(-0-042, 0-070) 

SAR + TSLS 

to¬ 

042, 0 - 

054) 

(0 

031,0- 

059) 

(-0-018, 0-065) 

SAR -P CLR 

co- 

043, 0 - 

055) 

(0 

034, 0 - 

067) 

(-0-049, 0-072) 

TSLS, two-stage 

least 

squares; 

AR, Anderson-Rubin test; 

CLR, conditional 


likelihood ratio test; SAR, Sargan test. There are four candidate instruments. 


intervals from our methods eontain the null value 0 and use the three tests in Seetion 4-1. For 
proeedures with pretests, we used the same ai and a 2 we did in Seetion 4-1. 

The empirieal findings are summarized in Table 3. Even if there is an instrument that is invalid, 
there is a signifieant effeet of ineome on food expenditures; at f7 = 2, all the tests do not eontain 
the null value zero. But, if more than one instrument is invalid, U > 2, all the tests eontain the 
null value zero and the eausal effeet is no longer signifieant. We also note that our method in (4) 
using the Anderson-Rubin or the pre-testing method with two-stage least squares provide the 
shortest interval. 

The empirieal applieation illustrates the usefulness of our proeedure whenever there is a eon- 
eern for invalid instruments. In partieular, our proeedure is a simple modifieation of pre-existing 
proeedures for instrumental variables whieh yield honest eonfidenee intervals and ean aet as a 
sensitivity analysis for violation of IV assumptions (A2) and (A3). The supplementary materials 
provides additional data analysis with this data to further highlight the strengths of our eonfi¬ 
denee inferval approaeh in praefiee. 


6. Discussion 

This paper proposes a simple and general mefhod fo consfrucf robusf confidence infervals for 
causal effecfs using insfrumenfal variables esfimafes when fhe insfrumenfs are possibly invalid, 
wifh fheorefical guarantees wifh respecf fo coverage. We propose fwo mefhods in (4) and (5) fhaf 
are simple modifications of pre-existing mefhods in insfrumenfal variables fhaf profecfs againsf 
invalid insfrumenfs. Our dafa analysis example illusfrales fhaf our mefhod can be a simple, robusf 
alternative fo confidence infervals fhaf have fhe proper coverage whenever fhere is concern for 
possibly invalid insfrumenfs and can assess fhe sensifivify of our inference fo violafions of IV 
assumptions. 


Supplementary material 

Supplemenfary material available af Biometrika online includes fheorefical defails, additional 
simulations and empirical analysis. 


Appendix 1 
A-1. Proofs 

Proof of Theorem 1. By s = c{B*) < U, there is a subset B where c{B) = U — \ and B* C B. Also, 
its complement B^ only contains valid instruments and thus, pr{/3* G Ci-a{Y, D, Z, B)} > 1 — a. 
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Hence, we have 

pr{/3* G Ci-ffY, D, Z)} > pr{/3* G Ci.ffV, D, Z,B)}>l-a 

for all values of /3*. □ 

Proof of Theorem 2. Similar to the proof for Theorem 1, B, which is a superset containing all invalid 
instruments, has to exist and additionally, have the property pr{S'(i3) > qi-a ^} < ai. Then, we can use 
Bonferroni’s inequality to obtain 

pr{/3* G D,Z)}> pr{/3* G (F, D, Z, B) n S{B) < } 

> 1 - pr{/3* i (F, D, Z,B)}- pr{5(,B) > } 

— OCi — 0L2 = 1 — OL 


thereby guaranteeing the correct coverage. 


□ 


Proof of Theorem 3. By Cochran’s theorem, (i) the numerator and the denominator of (7) are indepen¬ 
dent, (ii) the denominator, scaled by = (t| + (/3* — + 2(/3* — /3o)pcri(T2, is a central chi-square 

with n — L degrees of freedom, i.e. 

(Y-DPoVRziY-Dbo) {W*-l3off + eVRz{{f3*-boff + e} ^ 

a^{n-L) d^{n-L) ^ Xn-L,o 

and (iii), the numerator, scaled by if^, is a non-central chi-square distribution with non-centrality p{B) 

{Y - DPfff{Pz - Pzb){Y - DPo) ,,2 „(p\_\\(p p 

- _ c{B)} - XL-c{B),r,{B)’ V[R) — \\(Rz - PZb ) ^ {TT + 7 (P “Poillh 

Since {Pz — Pzb)Z can be rewritten as the residual projection of Z onto Zb, i.e. 

[Pz — Pzb)Y = Z — [Zb ■ Pzb^b^] = [0 : Zbc — = [0 : = RzbZ 

the AR(/3o, B) is a non-central F distribution with degrees of freedom. □ 
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